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Abstract 

This  work  outlines  a  unified  multi-threaded,  multi-scale  High  Performance  Computing  (HPC) 
approach  for  the  direct  numerical  simulation  of  Fluid-Solid  Interaction  (FSI)  problems.  The 
simulation  algorithm  relies  on  the  extended  Smoothed  Particle  Hydrodynamics  (XSPH) 
method,  which  approaches  the  fluid  flow  In  a  Lagranglan  framework  consistent  with  the 
Lagrangian  tracking  of  the  solid  phase.  A  general  3D  rigid  body  dynamics  and  an  Absolute 
Nodal  Coordinate  Formulation  (ANCF)  are  Implemented  to  model  rigid  and  flexible  multi¬ 
body  dynamics.  The  two-way  coupling  of  the  fluid  and  solid  phases  is  supported  through 
use  of  Boundary  Condition  Enforcing  (BCE)  markers  that  capture  the  fluid-solid  coupling 
forces  by  enforcing  a  no-slip  boundary  condition.  The  solid-solid  short  range  interaction, 
which  has  a  crucial  Impact  on  the  small-scale  behavior  of  fluid-solid  mixtures,  is  resolved 
via  a  lubrication  force  model.  The  collective  system  states  are  integrated  in  time  using  an 
explicit,  multi-rate  scheme.  To  alleviate  the  heavy  computational  load,  the  overall  algorithm 
leverages  parallel  computing  on  Graphics  Processing  Unit  (GPU)  cards.  Performance  and 
scaling  analysis  are  provided  for  simulations  scenarios  involving  one  or  multiple  phases  with 
up  to  tens  of  thousands  of  solid  objects.  The  software  Implementation  of  the  approach,  called 
Chrono::Fluld,  is  part  of  the  Chrono  project  and  available  as  an  open-source  software. 
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1.  Introduction 


The  last  decade  witnessed  a  paradigm  shift  In  the  microproces¬ 
sor  Industry  towards  chip  designs  that  provide  strong  support 
for  parallel  computing.  Teraflop  computing,  i.e.  computing  at 
the  rate  of  1012  FLoatlng-polnt  Operation  Per  Second,  until  re¬ 
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cently  the  privilege  of  a  select  group  of  large  research  centers,  is 
becoming  a  commodity  due  to  inexpensive  GPU  cards  and  multi- 
to  many-core  x86  processors.  The  steady  improvements  in  pro¬ 
cessor  speed  and  architecture,  memory  layout  and  capacity,  and 
power  efficiency  have  motivated  a  trend  of  re-evaluating  simu¬ 
lation  algorithms  with  an  eye  towards  identifying  solutions  that 
map  well  onto  these  new  hardware  platforms.  In  this  context,  a 
scrutinizing  of  the  existing  Fluid-Solid  Interaction  (FSI)  solu¬ 
tions  reveals  that  they  are  mostly  Inadeguate  since  they  either 
rely  on  algorithms  that  do  not  map  well  on  emerging  computer 
architectures,  or  they  are  unable  to  capture  phenomena  of  In¬ 
terest  which  may  regulre  the  simulation  of  tens  of  thousands  of 


Preprint  submitted  to  Archive  of  Mechanical  Engineering 


rigid  and  deformable  bodies  that  interact  directly  or  through  the 
fluid  media. 

FSI  problems  are  usually  simulated  In  either  an  Eulerlan- 
Lagranglan  framework,  where  the  Lagranglan  solid  phase  moves 
wlth/over  the  Eulerlan  grid  used  for  fluid  discretization,  or  an 
Eulerlan-Eulerlan  framework,  where  the  solid  phase  is  consid¬ 
ered  as  an  average  of  ensembles  Instead  of  a  specific  state. 
While  the  former  approach  delivers  the  flexibility  required  by 
general  multibody  dynamics  problems,  It  does  not  provide  a  level 
of  performance  suitable  for  the  simulation  of  dense  fluid-solid 
problems  such  as  suspensions.  This  can  be  traced  back  to  the 
costly  processes  that  overlaps  the  two  different  representations 
of  the  fluid  and  solid  phases. 

The  Lagranglan  representation  of  the  fluid  flow  dovetails  well 
with  the  Lagranglan  approach  used  for  the  motion  of  the  solid 
phase.  This  approach  has  recently  been  employed  In  the  context 
of  multibody  dynamics  and  particle  suspension  In  [12,  25,  29- 
31,  33].  Specifically,  the  FSI  methodology  relying  on  Smoothed 
Particle  Hydrodynamics  (SPH)  [16,  19,  22]  and  rigid  body 
Newton-Euler  equation  of  motlon’fTD]  was  discussed  in  detail 
In  [29,  30].  This  or  a  similar  framework  was  used  for  the  In¬ 
vestigation  and  validation  of  rigid  body  suspensions  [30],  as 
well  as  flexible  beams  Interacting  with  fluid  [12,  31,  33].  Addi¬ 
tionally,  support  for  the  short-range  solld-solicPinteractlon  was 
Introduced  In  [30,  31],  which  was  leveraged  In  the  simulation  of 
dense  suspensions. 

This  contribution  concentrates  on  the  high  performance  com¬ 
puting  approach  required  for  the  efficient  simulation  of  FSI 
problems.  Although  the  approach  and  algorithms  explained 
herein  can  leverage  any  multl-threaded  architecture,  the  CUDA 
library  [26]  was  employed  for  the  execution  of  all  solution 
components  on  the  CPU,  with  negligible  data  transfer  between 
the  host  (CPU)  and  the  device  (CPU).  The  paper  is  organized 
as  follows.  The  numerical  methods  adopted  for  the  simulation 
of  fluid  and  solid  phases,  fluid-solid  coupling,  and  solid-solid 
short  range  Interaction  are  explained  In  Sect.  2.  The  com¬ 
putational  scheme,  Including  the  time  Integration  algorithm, 
HPC-based  Implementation,  and  parallel  neighbor  search 
required  for  SPH  are  described  In  Sect.  3.  Section  4  contains  a 
performance  analysis  of  the  algorithm  using  Kepler-type  GPU 
cards. 


2.  Numerical  approach 

The  proposed  FSI  simulation  framework  combines  SPH  for  the 
simulation  of  the  fluid  flow,  the  Newton-Euler  formalism  for  rigid 
body  dynamics,  and  ANCF  for  modeling  flexible  body  dynam¬ 
ics.  These  algorithmic  components  are  described  in  more  detail 


In  the  following  subsections,  including  a  discussion  on  the  ap¬ 
proach  adopted  for  resolving  fluid-solid  and  solid-solid  interac¬ 
tions. 


2.1.  Smoothed  Particle  Hydrodynamics 

The  term  smoothed  in  SPH  refers  to  the  approximation  of  point 
properties  via  a  smoothing  kernel  function  W,  defined  over  a 
support  domain  S.  This  approximation  reproduces  functions  with 
up  to  2nd  order  of  accuracy,  provided  the  kernel  function:  (1) 
approaches  the  Dirac  delta  function  as  the  size  of  the  support 

domain  tends  to  zero,  that  is  lim  Wlr,h)  =  <5(r),  where  r  is  the 

h^O 

spatial  distance  and  h  is  a  characteristic  length  that  defines  the 
kernel  smoothness;  (2)  Is  symmetric,  i.e.,  W(r,  h)  =  W(— r,  h ); 
and  (3)  is  normal,  l.e.,  fs  W(r,  /?)dV  =  1,  where  dV  denote 
the  differential  volume.  A  typical  spatial  function  f(x)  Is  then 
approximated  by  {f(x))  as 

f(x)  =  J  f(x')5(x  -  x')d¥ 

s 

=  J  f(x)W(x-x,h)dV+0(h2)  f1) 

s 

=  (/(x)>  +  0(h2)  . 


To  simplify  notation,  in  the  remainder  of  this  document  we  use 
f(x)  to  represent  (f(x)).  Using  Eq.  (1)  and  the  divergence  theo¬ 
rem,  the  spatial  derivatives  of  a  function  can  be  mapped  to  the 
derivatives  of  the  kernel  function.  For  instance,  the  gradient  of 
a  function  can  be  written  as 


Vf(x)  =  J  f(x')W(x  —  x',  h)ndA 

dS 

-  J  f(x')VH/(x  —  x',  h)dY  , 

s 


(2) 


where  dS  is  the  boundary  of  the  integration  volume  V  and  dA 
is  the  differential  area.  By  Imposing  an  additional  property  for 
the  kernel  function,  namely  that  (4)  It  approaches  zero  as  r  In¬ 
creases,  i.e.,  lim  W(r,  h)  =  0,  the  first  term  on  the  right  hand 

r— »oo 

side  of  Eq.  (2)  vanishes.  Note  that  some  additional  considera¬ 
tions,  which  will  be  addressed  later,  are  required  for  the  SPH 
approximation  near  boundaries. 

The  term  particie  in  SPH  terminology  Indicates  the  discretiza¬ 
tion  of  the  domain  by  a  set  of  Lagranglan  particles.  To  remove 
the  ambiguity  caused  by  the  use  of  the  term  rigid  particles  in 
the  context  of  FSI  problems,  we  use  the  term  marker  to  refer 
to  the  SPH  discretization  unit.  Each  marker  has  mass  m  as¬ 
sociated  to  the  representative  volume  dV  and  carries  all  of  the 
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essential  field  properties.  As  a  result,  any  field  property  at  a 
certain  location  is  shared  and  represented  by  the  markers  in 
the  vicinity  of  that  location.  The  value  of  a  certain  unknown 
at  a  given  location  is  calculated  according  to  the  distance  be¬ 
tween  that  location  and  the  collection  of  markers  in  its  vicinity 
using  the  field  values  at  these  markers  and  the  expression  of 
the  kernel  function  W.  This  leads  for  the  second  approximation 
embedded  in  SPH,  which  can  be  expressed  as 


f  (x)  =  j  w(x  ~  *■  h)p(x')dV 

s 

-  Y~  —  f(xb)W(x  -  xbl  h )  , 

V  p» 


(3) 


2.1.1.  SPH  for  fluid  dynamics 

For  fluid  dynamics,  the  continuity  and  momentum  eguations  are 
given  in  Lagrangian  form  as 

^  =  " pVv  (5) 

and 

^  = -lvp+ -V2v  +  f  ,  (6) 

dt  p  p 

where  p  is  the  fluid  viscosity,  while  v  and  p  are  the  flow  ve¬ 
locity  and  pressure,  respectively.  Within  the  SPH  framework 
described  earlier,  Egs.  (5)  and  (6)  are  discretized  at  an  arbi¬ 
trary  location  x  =  xa  within  the  fluid  domain  as  [24] 


where  b  is  the  marker  index  and  pb  is  the  fluid  density,  smoothed 
at  the  marker  location  xb.  The  summation  in  Eg.  (3)  is  over  all 
markers  whose  support  domain  overlaps  the  location  x.  Several 
other  properties  of  the  kernel  functions  are  provided  in  [16].  For 
instance,  the  kernel  function  should  be  a  positive  and  monoton- 
ically  decreasing  function  of  r,  which  implies  that  the  influence 
of  distant  markers  on  field  properties  at  a  given  location  is  less 
than  that  of  nearby  markers.  Moreover,  for  acceptable  com¬ 
putational  performance  and  to  avoid  a  guadratic  computational 
complexity,  kernel  functions  have  a  compact  domain  of  influence 
with  a  radius  Rs  defined  as  some  finite  multiple  k  of  the  char¬ 
acteristic  length  h,  as  shown  in  Figure  1.  The  methodology 
adopted  herein  relies  on  a  cubic  spline, 


Po 


dpo 

dt 


Pc,  y~  —  (va  -  Vb)  -Vc  W„b  , 

b  Pb 


(7) 


and 


b 


Po  Pb 

Pa2  Pb2 


v  a  wab  +  ru 


+  f„.  (8) 


In  the  above  eguations,  guantities  with  subscripts  a  and  b  are 
associated  with  markers  a  and  b  (see  Figure  1),  respectively.  It 
is  important  to  note  that  these  guantities  are  different  from  the 
corresponding  physical  guantities  at  locations  x„  and  xb.  The 
viscosity  term  fU  is  defined  as 


W(q,h) 


1 

4jrh3 


x  - 


(2  -  q)3  ~  4(1 
(2  ~q)3, 


0, 


q)3,  0  <  q  <  1 

1  <  q  <  2  ,  (4) 

Q>  2 


where  q  =  |r|  //?,  has  a  support  domain  with  radius  2 h. 


Figure  1 .  Illustration  of  the  kernel  W  and  its  support  domain  S.  SPH  markers 
are  shown  as  black  dots.  For  2D  problems  the  support  domain  is  a 
circle,  while  for  3D  problems  it  is  a  sphere. 


ru 


(^o  "I"  Vb)Xab'^aWat) 

Tt  t  ^  ab 

Pab(Xab  +  chob) 


(9) 


where  xab  =  x„  —  xb,  Wab  =  W(xab,  h ),  V„  is  the  gradient  with 
respect  to  x„,  i.e.,  d/dxa,  and  c  is  a  regularization  coefficient. 
Quantities  with  an  over-bar  are  averages  of  the  corresponding 
guantities  for  markers  a  and  b. 

Alternative  viscosity  discretizations  include: 

1.  the  model  suggested  in  [5]: 

no6  =  -PoPbXob-Vabl{PaPb(Pa  +  Pb){*2b  +  eh2abj\W aWob  \ 

2.  direct  discretization  of  V2  operator  [22,  24];  and 

3.  the  class  of  artificial  viscosity  models  introduced  in  [17, 
1 9,  23] 

However,  using  Eg.  9  is  preferred  since  it  has  the  following  prop¬ 
erties:  (1)  it  ensures  that  the  viscous  force  is  along  the  shear 
direction  \ab,  instead  of  the  particles  center  line  xab,  (2)  it  is 
less  sensitive  to  local  velocities;  (3)  it  allows  for  better  compu¬ 
tational  efficiency  by  removing  the  nested  loop  reguired  for  the 
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computation  of  the  V2  operator;  and  (4)  it  is  stated  in  terms  of 
physical  properties,  rather  than  model  parameters  like  those  in 
artificial  viscosity,  which  are  introduced  primarily  for  numerical 
stabilization  through  damping.  In  the  simulation  of  transient 
Poiseuille  flow,  although  we  managed  to  virtually  obtain  exact 
results  using  Eg.  9  [30],  the  error  caused  by  implementing  either 
n*fc  or  artificial  viscosity  were  non-negligible. 

In  the  weakly  compressible  SPH  model,  the  pressure  p  is  eval¬ 
uated  using  an  eguation  of  state  [22] 


where 

A  v0  =  (v„-Vo)M/o/),  (14) 

b  p°b 

and  0  <  £  <  1  adjusts  the  contribution  of  the  neighbors'  veloci¬ 
ties.  All  the  results  reported  herein  were  obtained  with  (  =  0.5. 
The  modified  velocity  calculated  from  Eq.  (13)  replaces  the  orig¬ 
inal  velocity  in  the  density  and  position  update  equations,  but 
not  in  the  momentum  equation  [6]. 


P  = 


(10) 


where  po  is  the  reference  density  of  the  fluid,  y  tunes  the  stiff¬ 
ness  of  the  pressure-density  relationship,  and  cs  is  the  speed 
of  sound.  The  value  cs  is  adjusted  depending  on  the  maximum 
speed  of  the  flow,  VmM,  to  keep  the  flow  compressibility  below 
some  arbitrary  value.  We  use  y  =  7  and  cs  =  10\/max,  which 
allows  1%  flow  compressibility  [22]. 

The  fluid  flow  equations  (7)  and  (8)  are  solved  together  with 


x 


a 


d*a 

dt 


(11) 


to  update  the  position  of  the  SPH  markers. 

The  original  SPH  summation  formula  calculates  the  density  ac¬ 
cording  to 

Pa=^mbWab.  (12) 

b 

Equation  (7),  which  evaluates  the  time  derivative  of  the  density, 
was  preferred  to  the  above  since  it  produces  a  smooth  density 
field,  works  well  for  markers  close  to  the  boundaries  (the  free 
surface,  solid,  and  wall),  and  does  not  exhibit  the  large  varia¬ 
tions  in  the  density  field  introduced  when  using  Eq.  (12)  close 
to  the  boundaries.  However,  Eq.  (7)  does  not  guarantee  consis¬ 
tency  between  a  marker's  density  and  the  associated  mass  and 
volume  [3,  21,  24].  The  so-called  density  re-initialization  tech¬ 
nique  [6]  attempts  to  address  this  issue  by  using  Eq.  (7)  at  each 
time  step  and  periodically;  i.e.,  every  n  time  steps,  use  Eq.  (12) 
to  correct  the  mass-density  inconsistency.  The  results  reported 
herein  were  obtained  with  n  =  10.  The  Moving  Least  Squares 
method  or  a  normalized  version  of  Eq.  (12)  could  alternatively 
be  used  to  address  the  aforementioned  issues,  see  [6,  7]. 
Finally,  the  methodology  proposed  employs  the  extended  SPH 
approach  (XSPH),  which  prevents  extensive  overlap  of  markers' 
support  domain  and  enhances  flow  incompressibility  [20].  This 
correction  takes  into  account  the  velocity  of  neighboring  mark¬ 
ers  through  a  mean  velocity  evaluated  within  the  support  of  a 
nominal  marker  a  as 


v„  =  v„  +  Av0 


(13) 


2.2.  Rigid  body  dynamics 

Once  the  fluid-solid  interactions  between  individual  markers; 
i.e.,  the  right  hand  side  of  Eqs.  (7)  and  (8),  are  accounted  for, 
the  total  rigid  body  force  and  torque  due  to  the  interaction  with 
the  fluid  can  be  obtained  by  respectively  summing  the  individ¬ 
ual  forces  and  their  induced  torques  over  the  entire  rigid  body. 
They  are  then  added  to  any  other  forces,  including  external  and 
contact  forces.  The  dynamics  of  the  rigid  bodies  is  fully  char¬ 
acterized  by  the  Newton-Euler  equations  of  motion  (EOM),  see 
for  instance  [10]: 


TV,  F, 

~dt  ~  AT] ' 

(15) 

> 

II 

si* 

(16) 

=  J'”1  |t[  —  w'J'w'j  , 

(17) 

^i  =  lc 

dt  2 

(18) 

q  J  q;  - 1  =  o, 

(19) 

where  F, ,  T',  Xi:  Vi:  <a'  £  R3  denote  the  force,  torque,  po¬ 
sition,  velocity,  and  angular  velocity  associated  to  body  i, 
i  =  1,2 ,...,nb,  respectively.  The  quantity  q,  denotes  the 
rotation  quaternion,  while  AT,  and  J'  are  the  mass  and  mo¬ 
ment  of  inertia,  respectively.  Quantities  with  a  prime  symbol 
are  represented  in  the  rigid  body  local  reference  frame.  Given 
a  =  [ax,  aLJ,  az ]  £  R3  and  q  =  [qxi  qy,  qz,  qw]  £  R4,  the 

auxiliary  matrices  a  and  G  are  defined  as  [10] 


0 

~az 

ay 

~Py 

<7* 

<7  w 

-qz 

az 

0 

-ax 

and  G  = 

~Pz 

q  w 

qz 

qy 

_  a  y 

ax 

0 

q  w 

qz 

-qy 

qz  _ 

(20) 
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2.3.  Flexible  body  dynamics 

The  ANCF  formulatLon  [36],  whLch  allows  for  large  deforma¬ 
tions  and  large  rigid  body  rotations,  is  adopted  herein  for  the 
simulation  of  flexible  bodies  suspended  In  the  fluid.  While  ex¬ 
tension  to  other  elastic  elements  Is  straightforward,  the  current 
Chrono::Fluld  Implementation  only  supports  gradient  deficient 
ANCF  beam  elements,  which  are  used  to  model  slender  flexible 
bodies  composed  of  ne  adjacent  ANCF  beam  elements.  The  flex¬ 
ible  bodies  are  modeled  using  a  number  n„  =  ne  + 1  of  egually- 
spaced  node  beam  elements,  each  represented  by  6  coordinates, 
e;-  =  [r J  ,  rJx]T ,  j  =  0,1,...,  ne;  i.e.,  the  three  components  of 
the  global  position  vector  of  the  node,  and  the  three  components 
of  its  slope.  This  is  therefore  eguivalent  to  a  model  using  ne 
ANCF  beam  elements  with  6  x  n„  continuity  constraints,  but 
is  more  efficient  in  that  it  uses  a  minimal  set  of  coordinates. 
We  note  that  formulations  using  gradient  deficient  ANCF  beam 
elements  display  no  shear  locking  problems  [9,  34,  35]  and,  due 
to  the  reduced  number  of  nodal  coordinates,  are  more  efficient 
than  fully  parametrized  ANCF  elements.  However,  gradient  de¬ 
ficient  ANCF  beam  elements  cannot  describe  a  rotation  about 
its  axis  and  therefore  cannot  model  torsional  effects. 

Consider  first  a  single  ANCF  beam  element  of  length  T.  The 
global  position  vector  of  an  arbitrary  point  on  the  beam  center 
line,  specified  through  its  element  spatial  coordinate  0  <  x  <  £, 
is  then  obtained  as 


The  generalized  element  elastic  forces  are  obtained  from  the 
strain  energy  expression  [36]  as 

v  =  LEAc"[^),d’‘+lE"‘ [%)'“'•■  1251 

where  e-ii  =  (rjr*  —  1 )  /2  is  the  axial  strain  and  k  =  rx  x  rxx/rx3 
is  the  magnitude  of  the  curvature  vector.  The  reguired  deriva¬ 
tives  of  the  position  vector  r  can  be  easily  obtained  from 
Eg.  (21)  in  terms  of  the  derivatives  of  the  shape  functions  as 
rx(x,  e)  =  S„(x)e  and  rxx(x,  e)  =  SYX(x)e. 

External  applied  forces,  in  particular  the  forces  due  to  the  inter¬ 
action  with  the  fluid,  see  Sect.  2.4,  are  included  as  concentrated 
forces  at  a  BCE  marker.  The  corresponding  generalized  forces 
are  obtained  from  the  expression  of  the  virtual  work  as 

Q°  =  Sr(x„)F  ,  (26) 

where  F  is  the  external  point  force  and  the  shape  function  matrix 
is  evaluated  at  the  projection  onto  the  element's  center  line  of 
the  force  application  point.  The  generalized  gravitational  force 
can  be  computed  as 

Qs  =  J^psASTgdx.  (27) 


r(x,  e)  =  S(x)e  ,  (21 ) 

where  e  =  [ej  ,  ej]r  e  R12  is  the  vector  of  element  nodal 
coordinates.  With  I  being  the  3x3  identity  matrix,  the  shape 
function  matrix  S  =  [Si I  S2I  S3I  S4I]  £  R3x12  is  defined 
using  the  shape  functions  [36] 

St  =  1  —  3^2  -F  2<F 

52  =  C[Z-2e  +  e) 

53  =  3<f2  —  2<J3  [  ’ 

s4  =  ^K2  +  <F)  , 

where  =  x/T  £  [0, 1],  The  element  EOM  are  then  written  as 
Me  +  Q"  =  Q°  ,  (23) 


In  the  above  expressions,  ps  represents  the  element  mass  den¬ 
sity,  A  is  the  cross  section  area,  E  is  the  modulus  of  elasticity, 
and  /  is  the  second  moment  of  area. 

The  EOM  for  a  slender  flexible  body  composed  of  ne  ANCF 
beam  elements  are  obtained  by  assembling  the  elemental  EOMs 
of  Eg.  (23)  and  taking  into  consideration  that  adjacent  beam  el¬ 
ements  share  6  nodal  coordinates.  Let  e  =  [ej  ,  ej  ,  ...  e„  ]r 
be  the  set  of  independent  nodal  coordinates;  then  the  nodal  co¬ 
ordinates  for  the  y-th  element  can  be  written  using  the  mapping 


e/ 

=  B,e ,  with  B,  = 

0 

0  . 

..I3  0  . 

.  .0 

er 

J  '  1 

j 

0 

0  . 

.  0  l3  . 

.  .0 

and  the  assembled  EOMs  are  obtained,  from  the  principle  of 
virtual  work,  as  follows.  Denoting  by  M;  be  the  element  mass 
matrix  of  Eg.  (24)  for  the  y-th  ANCF  beam  element,  it  can  be 
written  in  block  form  as 


where  Qe  and  Q°  are  the  generalized  element  elastic  and  ap¬ 
plied  forces,  respectively,  and  M  £  R12x12  is  the  symmetric 
consistent  element  mass  matrix  defined  as 


M,= 


1 /.« 

'j.rl 


Mj.iT 

Mj.rr 


(29) 


M  = 


l 


psASTSdx . 


(24) 


where  My,/r  =  Mjrl  and  all  sub-blocks  have  dimension  6x6. 
Here,  i  denotes  the  left  end  of  the  beam  element,  i.e.,  the  node 
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characterized  by  the  nodal  coordinates  ey_i,  while  r  corresponds 
to  the  node  with  coordinates  e;.  With  a  similar  decomposition 
of  a  generalized  element  force  into 


Q/  = 


Qp 

Qj.r 


(30) 


one  obtains 

MS  =  Q“  -  Qe  (31) 

where 


M 


Mi,//  M  iir 

M i,r/  M!  rr  +  M 2,11 


M2,„ 


M„ 


LQi,/ 

Qi,/ 

LQl  +  EQh 

Ql,r  +  Q  2,1 

Q°  —  Qe  = 

LQlr  +  LQh 

- 

Q  !.r  +  Q  3,/ 

.  EQLr  . 

QLr  . 

(32) 


(33) 


Inclusion  of  additional  kinematic  constraints,  e.g.,  anchoring  the 
beam  at  one  end  to  obtain  a  flexible  cantilever  or  fixing  its 
position  to  obtain  a  flexible  pendulum,  can  be  done  either  by 
formulating  the  EOM  as  differential-algebraic  eguations  or  by 
deriving  an  underlying  ODE  after  explicitly  eliminating  the  cor¬ 
responding  constrained  nodal  coordinates.  The  latter  approach 
was  used  in  ail  simulations  involving  flexible  cantilevers  dis¬ 
cussed  in  Sect.  4. 


2.4.  Fluid-solid  interaction 

The  two-way  fluid-solid  coupling  was  implemented  based  on  a 
methodology  described  in  [29],  The  state  update  of  any  SPH 
marker  relies  on  the  properties  of  its  neighbors  and  resolves 
shear  as  well  as  normal  inter-marker  forces.  For  the  SPH  mark¬ 
ers  close  to  solid  surfaces,  the  SPH  summations  presented  in 
Egs.  (7),  (8),  (1 2),  and  (1 4)  capture  the  contribution  of  fluid  mark¬ 
ers.  The  contribution  of  solid  objects  is  calculated  via  Boundary 
Condition  Enforcing  (BCE)  markers  placed  on  and  close  to  the 
solid  surface  as  shown  in  Figure  2.  The  velocity  of  a  BCE  marker 
is  obtained  from  the  rigid/deformable  body  motion  of  the  solid 
thus  ensuring  the  no-slip  condition  on  the  solid  surface.  Includ¬ 
ing  BCE  markers  in  the  SPH  summation  eguations  (7)  and  (8) 
results  in  fluid-solid  interaction  forces  that  are  added  to  both 
fluid  and  solid  markers. 


Figure  2.  Fluid-solid  interaction  using  BCE  markers  attached  to  a  rigid  bodg. 

BCE  and  fluid  markers  are  represented  by  black  and  white  circles, 
respectively.  The  BCE  markers  positioned  In  the  Interior  of  the  body 
(markers  g  and  f  In  the  figure)  should  be  placed  to  a  depth  less 
than  or  equal  to  the  size  of  the  compact  support  associated  with  the 
kernel  function  W.  A  similar  concept  yet  with  different  kinematics  is 
used  for  flexible  bodies. 


2.5.  Solid-solid  short  range  interaction 

Dry  friction  models  typically  used  to  characterize  the  dynamics 
of  granular  materials  [1,  13,  14]  do  not  capture  accurately  the 
impact  of  solid  surfaces  irTTiydrodynamics  media.  In  practice, 
it  is  infeasible  to  resolve  the  short-range,  high-intensity  impact 
forces  arising  in  wet  media  due  to  computational  limits  on  space 
and  time  resolution.  Ladd  [15]  proposed  a  normal  lubrication 
force  between  two  spheres  that  increases  rapidly  as  the  distance 
between  particles  approaches  zero  and  therefore  prevents  the 
actual  touching  of  the  spheres: 


p/u/j 

r'7 


min 


OiOj 

o,  4-  a j 


(34) 


where,  o,  and  Cy  are  the  sphere  radii,  \nij  is  the  normal  com¬ 
ponent  of  the  relative  velocity,  and  s  is  the  distance  between 
surfaces.  For  s  >  Ac,  F[yft  =  0,  and  the  spheres  are  subject 
only  to  hydrodynamic  forces. 

Eguation  (34)  provides  a  basic  model  for  the  estimation  of  the 
lubrication  force  in  normal  direction.  The  generalization  of  this 
model  to  non-spherical  objects  reguires  the  calculation  of  the 
minimum  distance  and  the  curvature  of  the  two  contact  surfaces. 
The  calculation  of  the  partial  lubrication  force  between  non- 
spherical  surfaces  follows  the  approach  proposed  in  [8]  for  a 
lattice  Boltzmann  method  but  is  amended  to  fit  the  Lagrangian 
formulation  adopted  herein.  Accordingly,  the  force  model  pro¬ 
vided  in  Eg.  (34)  is  modified  as 

p lub  _  Y  ck 

u  —  Z_  V 

k 

with  f fj  =  min  ■  - 


-7T/7/7-  I  -  -  I  ,0 


(35) 
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where  s*  and  v*  denote  the  markers'  relative  distance  and  ve- 

n‘i 

loclty,  respectively,  and  the  summation  is  over  all  interacting 
markers  of  the  two  solid  objects. 

3.  HPC  Implementation 

Chrono::Fluid  [4],  an  open-source  simulation  framework  for  fluid- 
solid  interaction,  provides  an  implementation  that  executes  all 
phases  of  the  solution  method  on  the  CPU.  A  brief  overview 
of  the  CPU  hardware  and  programming  model  adopted  is  fol¬ 
lowed  below  by  a  detailed  discussion  of  the  critical  kernels  that 
implement  the  proposed  modeling  and  solution  approach. 

3.1.  CPU  hardware  and  programming  model 

To  a  very  large  extent,  the  performance  of  today's  simulation 
engines  is  dictated  by  the  memory  bandwidth  of  the  hardware 
solution  adopted.  Recent  numerical  experiments  conducted  by 
the  authors  revealed  that  only  about  5  to  10%  of  the  peak  flop 
rate  is  reached  by  the  large  number  of  cores  present  on  today's 
architectures  since  these  cores  most  of  the  time  idle  waiting  for 
data  from  global  memory  or  RAM.  It  is  this  observation  that 
motivated  the  selection  of  the  CPU  as  the  target  hardware  for 
implementing  Chrono::Fluid.  At  roughly  300  CB/s,  the  GPU 
memory  bandwidth  stands  four  to  five  times  higher  than  what 
one  could  expect  on  a  fast  CPU. 

To  describe  the  hardware  organization  of  the  CPU,  we  con¬ 
sider  here  the  NVIDIA  GeForce  CTX  680  [18,  27],  This  CPU  is 
based  on  the  first  generation  of  Kepler  architecture,  code  name 
CK104,  which  is  also  implemented  in  Tesla  K10.  The  Kepler 
architecture  relies  on  a  Graphics  Processing  Cluster  (CPC)  as 
the  defining  high-level  hardware  block.  There  are  a  total  of  4 
CPCs  on  the  CK104.  Each  CPC  includes  two  Stream  Multi¬ 
processors  (SM),  each  of  which  has  192  Scalar  Processors  (SP), 
for  a  total  of  1536  SPs,  and  3.1  TFlops  processing  rate. 

In  addition  to  the  processing  cores,  the  second  important  aspect 
of  CPU  hardware  is  that  of  the  memory  hierarchy.  The  memory 
on  the  CPU  is  divided  into  several  types,  each  with  different 
access  patterns,  latencies,  and  bandwidths: 

Registers  (read/write  per  thread):  65536,  32-bit  memory  units. 
Very  low  latency,  high  bandwidth  (~  10  TB/s  cumulative) 
memory  used  to  hold  thread-local  data. 

Shared  memory/Li  cache  (read/write  per  block):  64  KB.  Low 
latency,  high  bandwidth  (~  1.5-2  TB/s  cumulative)  mem¬ 
ory  divided  between  shared  memory  and  LI  cache. 

Global  memory  (read/write  per  grid):  2  CB.  Used  to  hold  in¬ 
put  and  output  data.  Accessible  by  ail  threads,  with  a 
bandwidth  of  192  GB/s  and  a  higher  latency  (~  400-800 


cycles)  than  shared  memory  and  registers.  All  accesses 
to  global  memory  pass  through  the  L2  cache.  The  lat¬ 
ter  is  512  KB  large  and  has  a  bandwidth  of  512  B/ciock 
cycle. 

Constant  memory  (read  only  per  grid):  48  KB  per  Kepler  SM. 
Used  to  hold  constants,  serviced  at  the  latency  and  band¬ 
width  of  the  LI  cache  upon  a  cache  hit,  or  those  of  the 
global  memory  upon  a  cache  miss. 

The  parallel  execution  paradigm  best  supported  on  GPUs  is 
"single  instruction  multiple  data"  (SIMD).  In  this  model,  if  for 
instance  two  arrays  of  length  one  million  are  to  be  added,  one 
million  threads  are  launched  with  each  executing  the  same  in¬ 
struction;  i.e.,  adding  two  numbers,  but  on  different  data  -  each 
thread  adding  two  different  numbers.  Fortunately,  SIMD  com¬ 
puting  is  very  prevalent  in  the  solution  methodology  proposed, 
where  each  SPH  marker  is  handled  by  a  thread  in  the  same 
way  in  which  hundreds  of  thousands  of  other  threads  handle 
their  markers  using  different  data.  While  strongly  leveraging 
the  SIMD  model  owing  to  the  fine  grain  parallelism  that  it  ex¬ 
poses,  the  methodology  adopted  is  prone  to  lead  to  memory  ac¬ 
cess  patterns  that  do  not  display  high  spatial  and/or  temporal 
locality.  This  is  because  the  SPH  markers  move  in  time  leading 
to  less  structured  memory  accesses  that  adversely  impact  the 
effective  bandwidth  reached  by  the  code. 

3.2.  Time  integration 

Chrono::Fluid  uses  a  second  order  explicit  mid-point  Runge- 
Kutta  (RK2)  scheme  [2]  for  the  time  integration  of  the  fluid  and 
solid  phases,  the  latter  in  its  rigid  or  flexible  representation. 
Algorithm  1  summarizes  the  steps  reguired  for  the  calculation 
of  the  force  on  the  SPH  markers,  rigid  bodies,  and  deformable 
beams  at  time  step  k.  The  variables  Nm,  Nr,  and  N/  denote 
the  number  of  markers,  rigid  bodies,  and  flexible  beams,  respec¬ 
tively.  The  arrays  p,  x,  v,  and  v  store  the  density,  position, 
velocity,  and  modified  velocity  for  ail  markers,  respectively;  for 
example,  p  =  {pa\a  =  0, 1 , 2, ...,  Nm  —  1 }. 

The  external  forces  on  the  rigid  and  flexible  bodies  include  the 
FSI  forces  captured  via  BCE  markers  at  distributed  locations  on 
the  solid  surfaces.  The  distributed  forces  need  to  be  accumu¬ 
lated  into  a  single  force  and  torgue  at  the  center  of  each  rigid 
body,  or  point  forces  at  node  locations  of  each  flexible  body. 
The  reduction  (summation  of  the  forces  and  torgues)  is  han¬ 
dled  by  parallel  scan  operations  available  through  the  Thrust 
library  [11],  which  exposes  a  scan  algorithm  that  scales  linearly. 
The  RK2  integration  scheme  requires  the  calculation  of  the  force 
at  the  beginning  as  well  as  middle  of  the  time  step.  Algorithm  2 
lists  the  steps  required  for  the  time  integration  of  a  typical  FSI 
problem. 
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Algorithm  1  Force  Calculation 

>  Calculate  modified  XSPH  velocities 
1:  for  a  :=  0  to  (Nm  —  1)  do 

2:  vk  =va  (pk,xk,vk) 

3:  end  for 

>  5 PH  forces 

4:  for  a  :=  0  to  (Nm  —  1)  do 
5:  pk  =  pa  [pk  ,\k  ,<ik) 

6:  =  vk 

7: 

8:  end  for 

>  Rigid  body  forces 

9:  for  i  :=  0  to  (Nr  —  1)  do 
10:  V‘=V,(v*) 

11:  Xf  =  Vf 

12:  6i]  =  a>i  (v*1 ,  ) 

13:  qf  =  qf  (wf.qf) 

14:  end  for 

>  ANCF  forces 

15:  for  j  :=  0  to  (Nf  —  1)  do 

16:  Qy  =  Q°  (v*)  —  QJ  (e*)  +  Q?  (e^) 

17:  =  §2 

18:  4^  =  JCI-1 

19:  end  for 


To  improve  the  code  vectorization  and  use  of  fast  memory;  i.e., 
L1/L2  cache,  shared  memory,  and  registers,  each  computation 
task  was  implemented  as  a  sequence  of  light  GPU  kernels.  For 
instance,  different  computation  kernels  are  implemented  to  up¬ 
date  the  attributes  of  the  solid  bodies,  including  force,  moment, 
rotation,  translation,  linear  and  angular  velocity,  and  locations 
of  the  associated  BCE  markers.  A  similar  coding  philosophy 
was  maintained  for  the  density  re-initialization,  boundary  con¬ 
dition  implementation,  and  mapping  of  the  marker  data  on  an 
Eulerian  grid  for  post  processing. 
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Algorithm  2  RK2  Time  Integration 

>  Force  calculation  at  beginning  of  step  (Algorithm  1) 

Calculate  {v^,  pk ,  xk ,  vk ,  V*,  Xk ,  cok ,  q^,  Q^} 

>  Half-step  updates  for  fluid,  rigid  body,  and  flexible  body  states 
for  a  G  {a\a  is  a  fluid  marker}  do 

( jjk+V2  =ipk  +  ^ox  Af/2,  where  ( fja  e  {p„ , x„ , v„ } 
end  for 

for  i  :=  0  to  (Nr  —  1 )  do 

qj/c+l/2  =  qj*  +  qjji  x  Af/2,  where  4%  e  {V,,  X,,  q,} 

end  for 

for  j  \=  0  to  (Nf  —  1 )  do 
ek+V2  =  ek  +  §*  x  At  12 
ek+V2  =  e^  +  Qx  At  12 

end  for 

>  Half-step  update  for  BCE  marker  positions  and  velocities 
for  a  G  {a\a  Is  a  BCE  marker}  do 

Obtain  xk+ and  vk+ according  to  associated  rigid,  flexible, 
or  immersed  boundary  motion. 

end  for 

>  Force  calculation  at  half-step  (Algorithm  1) 

Calculate  {vk+^2,  pk+^2,  xM>2,  vk+^2,  Vk+ '>2,  Xk+ '>2,  cok+'12, 

q/f+1/2  q/t+1/2\ 

>  Full-step  updates  for  fluid,  rigid  body,  and  flexible  body  states 
for  a  G  {a\a  is  a  fluid  marker}  do 

^  =ijjk  +  tt+'l2  X  Af 

end  for 

for  i  :=  0  to  (Nr  —  1 )  do 

qjk+1  =  _|_  qj^+1/2  x  /\f 

end  for 

for  j  :=  0  to  (Nf  —  1 )  do 
e^+1  =  ef  +  §f+1/2  x  At 

§2+1  =  +  §2+1/2  x  Af 

end  for 

>  Full-step  update  for  BCE  marker  positions  and  velocities 
for  a  G  {a\a  is  a  BCE  marker}  do 

Obtain  xk+ 1  and  vj+1. 

end  for 


3.2.1.  Multi-rate  integration 


3.3.  Proximity  calculation  and  neighbor  search 


Stable  Integration  of  the  SPH  fluid  equations  requires  step- 
sizes  which  are  also  appropriate  for  propagating  the  dynamics 
of  any  rigid  solids  In  the  FSI  system.  However,  Integration  of 
the  dynamics  of  deformable  bodies,  especially  as  their  stiffness 
Increases,  calls  for  very  small  time  steps.  To  alleviate  the  associ¬ 
ated  computational  cost,  we  use  a  dual-rate  Integration  scheme 
using  Intermediate  steps  for  the  Integration  of  the  flexible  dy¬ 
namics  EOMs,  typically  with  AtspHl^ANCF  =  10,  although 
stlffer  problems  may  require  ratios  of  up  to  50.  This  aspect  Is 
noteworthy  given  that  typical  FSI  models  involve  a  number  of 
SPH  markers  orders  of  magnitude  larger  than  the  number  of 
ANCF  nodal  coordinates  required  for  the  flexible  bodies.  With¬ 
out  a  multi-rate  Implementation,  the  numerical  solution  would 
visit  the  fluid  phase  at  each  Integration  step,  an  approach  that 
led  to  very  large  solution  times.  This  aspect  Is  further  discussed 
In  Sect.  4. 


In  the  current  Implementation,  each  marker  has  a  list  of  neigh¬ 
bor  markers;  l.e.,  markers  that  fall  within  Its  support  domain. 
Given  this  set  of  lists,  the  calculation  of  vk  =  vD  (pk ,  xk ,  v* ), 
pk  =  pa  (pk ,  rk ,  v*),  and  <jk  =  v„  (pk ,  rk,  vkj  Is  straightforward. 
The  loop  Iterations  In  Algorithms  1  and  2  have  no  overlap  and 
can  be  executed  In  parallel.  The  computational  bottleneck  thus 
becomes  the  determination  of  the  neighbor  lists  through  prox¬ 
imity  calculation,  a  step  that  requires  about  70%  of  the  entire 
computational  budget  and  thus  critically  Impacts  the  overall  per¬ 
formance  of  the  simulation.  Herein,  we  adopt  a  spatial  binning 
algorithm  which  is  sub-optimal  in  terms  of  amount  of  net  work 
but  superior  in  terms  of  memory  usage.  In  this  approach,  the 
computation  domain  is  divided  into  a  collection  of  bins.  The 
bins  have  side  lengths  equal  to  the  maximum  influence  distance 
of  a  marker,  i.e.  kIi.  This  localizes  the  search  for  the  possible 
interacting  markers  to  the  bin  and  all  of  its  26  immediate  3D 
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Figure  3.  Illustration  of  the  spatial  subdivision  method  used  for  proximity  com¬ 
putation  in  2D.  The  circles  represent  the  domain  of  influence  of  each 
marker;  i.e.,  the  support  domain.  For  clarity,  a  coarse  distribution  of 
markers  is  shown.  In  reality,  the  concentration  of  markers  per  bin  is 
much  laryer. 


neighbors.  Figure  3  shows  the  binning  approach  in  2D. 

In  the  neighbor  search  approach  adopted  herein,  neighbor  Lists 
are  not  saved  in  memory;  instead,  neighbors  are  evaluated 
whenever  required.  Alternatively  [28],  the  principle  of  action 
and  reaction  could  be  leveraged  to  calculate  and  store  the  ac¬ 
celeration  terms  for  each  Lnter-penetratlon.  Although  the  second 
algorithm  reduces  the  amount  of  work  by  re-using  the  accel¬ 
eration  terms  calculated  for  each  Lnter-penetratlon,  It  can  not 
be  applied  to  the  SPH  method  due  to  the  massive  amount  of 
memory  required  to  store  the  data  associated  with  all  lnter- 
penetratlon  events.  Another  advantage  of  the  adopted  neighbor 
search  approach  Is  the  coalesced  memory  access  achieved  by 
sorting  and  accessing  the  data  based  on  the  markers  location. 
The  steps  required  for  accessing  the  neighbors  are  summarized 
in  Algorithm  3. 


Algorithm  3  Inner  loop:  accessing  neighbor  markers 

1:  Divide  the  solution  space  into  n/,  bins  of  (Ax,Ay,Az)  dimensions, 
where  n/,  =  nx  x  ny  x  nz,  and  (nx,ny,nz)  is  the  number  of  grid 
cells  along  (x,y ,z)  axis. 

2:  Construct  the  hash  array:  s  =  {s„|o  =  0, 1,2,...,  Nm  —  1 }  according 
to  sa  =  i  x  ny  x  nz  +  j  x  nz  +  k,  where  (i,  j,  k)  is  the  location  of 
the  bin  containing  the  marker  a. 

3:  Sort  s  into  s sorted  and  obtain  corresponding  psarted,  x sorted .  v sorted , 
and  ^sorted- 

4:  Construct  c1  =  {Cp\p  =  0,1,2,...,/?/,  —  1}  and  c2  =  {c2|p  = 
0, 1 , 2, ....  rjj,  —  1 },  where  cy  and  c2  denote  the  two  indices  in  s sorted 
that  bound  the  sequence  of  hash  values  sa  =  p. 

5:  Access  markers  data  in  bin  ( i,j,k )  by  loading  j^Cp.Cpj  portions  of 
the  sorted  arrays  psorted,  ^sorted,  v sorted ,  and  \ sorted,  as  needed. 


The  data  sorting  in  step  3  of  Algorithm  3  is  performed  using  the 
linear-complexity  radix  sort  available  In  the  Thrust  library.  As  a 


Figure  4.  Example  of  Chrono::Fluid  simulation:  channel  flow  over  an  array  of 
flexible  cantilevers.  (See  [32]  for  further  details) 


result,  this  solution  component  does  not  affect  the  overall  linear 
scaling  of  Algorithms  1  and  2.  Working  with  the  sorted  arrays 
for  the  bulk  of  the  computation  has  the  additional  advantage  of 
Increasing  the  memory  spatial  locality:  SPH  markers  that  share 
a  neighborhood  In  the  physical  space,  do  so  In  their  memory 
location  as  well. 

4.  Results 

The  simulation  approach  described  In  Sect.  3  was  Implemented 
to  execute  In  parallel  on  CPU  cards  using  the  CUDA  program¬ 
ming  environment  [26].  All  simulations  reported  In  this  section 
were  run  on  an  NVIDIA  GeForce  CTX  680  GPU  card  described 
In  sub-section  3.1.  In  all  simulations,  If  present,  the  flexible 
beams  were  modeled  using  ne  =  4  ANCF  beam  elements,  while 
the  Integrals  appearing  In  the  elastic  forces  Q e  in  Eq.  (25)  were 
evaluated  using  5  and  3  Gauss  quadrature  points  for  the  axial 
and  bending  elastic  forces,  respectively. 

A  snapshot  from  a  typical  FSI  problem  Involving  flexible  bodies 
is  shown  in  Figure  4.  Additional  Chrono::Fluld  simulation  ex¬ 
amples  can  be  found  at  [32].  Comprehensive  validation  studies 
of  the  Chrono::Fluid  simulation  package  are  provided  in  [31]. 
The  focus  here  is  on  numerical  experiments  that  Illustrate  the 
performance  and  scalability  of  the  proposed  algorithm  and  Its 
parallel  Implementation. 

We  begin  by  analyzing  the  efficiency  and  scaling  attributes  of 
the  solution  for  systems  composed  exclusively  of  rigid  bodies, 
flexible  bodies,  or  a  fluid  phase.  In  each  of  these  three  scenar- 
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Table  1.  Time  required  for  advancing  the  rigid  body  dynamics  simulation  by 
one  time  step  as  function  of  problem  size  (number  of  rigid  bodies,  Nr) 


Nm  =  0,  Nf  =  0 


Nr  (Xl03) 

0.49 

2.87 

16.59 

56.77 

118.23 

f  (ms) 

5 

8 

16 

44 

78 

Table  2.  Time  required  for  advancing  the  flexible  body  dynamics  simulation  by 
one  time  step  as  function  of  problem  size  (number  of  flxible  bodies, 

N,) 


Nm  =  0,  Nr  =  0 


N,  (x  1 03) 

0.78 

3.51 

17.55 

56.94 

115.05 

t  (ms) 

8 

14 

48 

122 

238 

ios,  we  provide  the  simulation  times  per  time  step  for  problems 
of  Increasing  size.  Data  provided  In  Tables  1,  2,  and  3  and  Illus¬ 
trated  In  Figure  5  Indicate  that  updates  of  the  dynamics  of  each 
phase  scale  linearly  with  problem  size;  i.e.,  with  the  number  of 
rigid  bodies,  flexible  bodies,  and  SPH  markers,  respectively. 
While  the  previous  results  show  linear  scaling  for  rigid  and 
flexible  body  dynamics,  in  actual  FSI  problems  in  which  rigid 
and/or  flexible  bodies  interact  with  the  fluid,  the  simulation  time 
is  virtually  independent  of  the  number  of  solid  objects.  The  re¬ 
sults  presented  in  Tables  4  and  5  were  obtained  on  a  system 
consisting  of  approximately  3  million  SPH  markers  by  vary¬ 
ing  the  number  of  rigid  bodies  and  flexible  bodies.  The  small 
sensitivity  of  the  simulation  time  with  respect  to  the  number  of 
solid  objects  is  due  to  the  fact  that  the  number  of  BCE  mark¬ 
ers  associated  with  solid  bodies  represents  only  a  very  small 
fraction  of  the  number  of  SPH  discretization  markers,  the  latter 
overwhelmingly  dictating  the  required  computation  time.  Never¬ 
theless,  as  the  concentration  of  solid  objects  increases,  smaller 
time  steps  are  required  since  the  probability  of  short-range, 
high-frequency  interactions  increases. 

The  two  sets  of  results  provided  in  Table  5  for  different  values 
of  r  =  At$pi-ilAtAHCF  show  a  small  increase  in  the  simulation 
time  when  choosing  a  very  small  integration  step  size  for  flexible 
body  dynamics.  This  illustrates  the  efficiency  of  the  multi-rate 
integration  scheme  in  improving  the  time  integration  stability 


Table  3.  Time  required  for  advancing  a  fluid  dynamics  simulation  by  one  time 
step  as  function  of  problem  size  (number  of  SPH  markers,  Nm) 


Nr  =  0,  Nf  =  0 


Nm  (xIO6) 

0.06 

0.32 

0.93 

1.79 

4.13 

f  (ms) 

27 

121 

331 

538 

1150 

Figure  5.  Scaling  analysis  of  Chrono::Fluid  for  rigid  body  dynamics,  flexible 
body  dynamics,  and  fluid  dynamics  as  function  of  problem  size  (Nr, 
Nf,  and  Nm,  respectively).  The  coefficients  R2  are  specified  to  each 
linear  regression. 


Table  4.  Simulation  time  per  time  step  for  an  FSI  problem  with  fixed  number 
of  SPH  markers  and  increasing  number  of  rigid  bodies 


Nm  ~  3.0  x  106,  Nf  =  0 


Nr 

0 

36 

120 

480 

1800 

8400 

33  600 

t  (ms) 

906 

919 

923 

925 

926 

926 

921 

at  a  small  increase  in  computational  cost.  The  small  changes 
in  the  simulation  times  provided  in  Tables  4  and  5  are  mainly 
due  to  the  deviations  in  the  magnitude  of  Nm  as  the  number  of 
solid  objects  changes.  Linear  scaling  in  Chrono::Fluid  is  also 
demonstrated  in  an  experiment  where  a  combined  FSI  problem, 
i.e.  involving  rigid  and  flexible  bodies  and  including  a  lubri¬ 
cation  force  model  and  two-way  coupling  with  fluid,  is  solved 
on  domains  of  increasing  size  (see  Figure  6).  As  the  simula¬ 
tion  domain  volume  is  increased  by  factors  from  2  up  to  32,  the 
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Table  5.  Simulation  time  per  time  step  for  an  FSI  problem  with  fixed  number 
of  SPH  markers  and  Increasing  number  of  flexible  bodies,  for  two 
different  values  of  the  multi-rate  Integration  factor  r  =  A tspH I^ANCF 


Nm  ~  3.0  x  1 06,  Nr  =  0 


Nf 

0 

45 

140 

440 

1152 

2100 

4704 

T  =  10 

t  (ms) 

906 

923 

928 

916 

960 

950 

921 

t  =  50 

t  (ms) 

906 

973 

978 

965 

1066 

1060 

1060 

Table  6.  Simulation  time  per  time-step  for  combined  FSI  problems  of  Increasing 
size 


Nm 

(xio6) 

0.08 

0.16 

0.29 

0.63 

0.95 

1.54 

2.50 

Nr 

(xIO3) 

0.17 

0.52 

1.12 

4.48 

7.84 

14.56 

24.64 

Nf 

(xIO3) 

0.16 

0.42 

0.84 

2.10 

3.36 

5.88 

9.66 

t 

(ms) 

45 

74 

120 

230 

343 

522 

820 

number  of  SPH  markers  varies  from  about  76,  000  to  more  than 
2.5  million.  Simultaneously,  the  number  of  rigid  and  flexible 
bodies  grow  from  168  to  more  than  24,  000  and  from  160  to  al¬ 
most  10,000,  respectively  (see  Table  6).  As  shown  In  Figure  6, 
Chrono::Fluld  achieves  linear  scaling  over  the  entire  range  of 
problem  sizes. 

5.  Conclusions 

This  contribution  discusses  details  of  an  HPC  approach  to 
the  simulation  of  FSI  problems.  Relying  on:  (1)  a  La- 
granglan/Lagranglan  formalism  for  the  simulation  of  the  fluid 
and  solid  phases,  and  ( LL)  CPU  parallelism,  Chrono::Fluld,  the 
software  Implementation  of  the  proposed  solution  methodology, 
Is  suitable  for  the  study  of  multl-phase/multl-component  prob¬ 
lems  such  as  particle  suspensions,  polymer  flow,  and  biomedical 
applications. 

Chrono::Fluld  has  been  shown  to  scale  linearly  In  the  dynam¬ 
ics  of  each  phase  Independently  as  well  as  In  the  overall  FSI 
solution  Implementation.  The  dynamics  of  the  fluid  phase;  I.e. 
the  time  propagation  of  the  SPH  markers,  controls  the  overall 
simulation  time,  and  neither  cross-phase  communications  nor 
solid  phase  simulation  have  a  notable  effect  on  the  simulation 
time.  This  represents  one  advantage  of  the  adopted  unified  FSI 
methodology  over  co-slmulatlon  and  asynchronous  approaches. 
Chrono::Fluld,  whose  preliminary  validation  Is  discussed  In  [30], 
Is  open  source  and  freely  available  under  a  BSD3  license. 
Several  directions  of  future  work  are  as  Identified  as  follows: 
(a)  revisit  the  SPH  marker  numbering  scheme  In  order  to  Im¬ 
prove  the  spatial  and  temporal  memory  access  patterns,  thus  In¬ 
creasing  the  effective  bandwidth  of  Chrono::Fluld;  ( b )  augment 
the  current  Implementation  for  use  on  cluster  supercomputers 


Figure  6.  Scaling  analysis  of  Chrono::Fluid  for  FSI  problems:  simulation  time 
as  function  of  combined  problem  size.  In  this  experiment,  the  volume 
of  the  simulation  domain  is  increased,  up  to  32  times  the  volume  of 
the  initial  domain,  leading  to  proportional  increases  in  the  number  of 
SPH  markers  and  of  solid  objects  (both  rigid  and  flexible  bodies)  as 
shown  in  the  bottom  plot  (see  also  Table  6).  As  illustrated  by  the  top 
graph,  the  simulation  time  per  time  step  varies  linearly  with  problem 
size.  The  coefficients  R 2  are  specified  for  each  linear  regression. 


that  adopt  a  non-uniform  memory  access  model  and  rely  on  the 
Message  Passing  Interface  standard;  (c)  Investigate  problems 
at  macroscale  such  as  vehicle  fording  operations,  which  call  for 
large  scale  simulations  at  low  Reynolds  numbers  and  possibly 
In  the  presence  of  very  large  viscosity;  and  ( d )  gauge  the  po¬ 
tential  of  Chrono::Fluld  In  biomechanics  applications  such  as 
blood  flow  In  deformable  arteries  or  heart,  channel  occlusion  In 
stroke,  etc. 
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